Learn the five number summary and compare to a box plot.
Note on Accessibility:
Data visualization is the production of charts and graphs that reveal trends and interrelationships to the eye. Although it can be done well by and for low-vision and colorblind users, dataviz is fundamentally visual, so it’s problematic for the blind. Data methods by and for the blind should strongly consider other techniques. Nevertheless, it’s important for both blind and sighted students of statistics to understand the larger discipline. I’ve tried to make these notes useful to everyone, understanding that a histogram might not be.
The Scatter Plot
A scatterplot is a visualization of two quantitative variables, appropriate when at least two numerical values are recorded about each data case. The \(x\) axis is given the range of values of one variable, and the \(y\) axis the other. Then each data case is represented as one point \((x,y)\) according to its variable’s values. A cloud of points arises and its shape is a clue to any association between the variables.
The Scatter Plot – Example
import seaborn as snsdf = pd.read_csv("county.csv")sns.scatterplot(data=df, x='homeownership', y='multi_unit')
Figure 1: Scatterplot Comparing homeownership to multi-unit dwelling by U.S. county
Of the three, which species is easiest to identify? How is it recognized?
What’s the best way to distinguish between the other two species?
Do flowers with wider petals usually have wider sepals too?
There are fewer dots on this scatterplot. What does that mean about the flower data?
Figure 3: Scatterplot petal width to sepal width for Fisher’s irises.
What’s a Box (-and-whisker) Plot?
A visualization of the distribution of one quantitative variable, an alternative to a histogram.
reveals the full range of data values along the \(x\)-axis.
divides the data range into four “equal” parts, called quartiles
the quartiles usually have unequal width, but
each quartile’s size is adjusted to include exactly a quarter of the data points.
the inner two quartiles are drawn as a box, and the outer two are drawn as “whiskers”
The five quartile boundary points are called: Min, 25%, 50%, 75%, and Max.
Box (-and-whisker) plot – Basic Example
sns.boxplot(data=df, x='poverty')
Based on the boxplot, what is a typical homeownership percent for U.S. counties?
For what x-regions are the data points tightly clustered? Where are they more thinly spread?
Figure 4: Boxplot for poverty percent by county.
Whisker Technicalities
Customarily, whiskers aren’t allowed to be more 1.5 times as long as boxes. If a boxplot would be drawn with long whiskers, trim them to 1.5 * [box size], and represent data beyond this length as individual dots. Both Python and OpenIntro do this. You need to know this to answer questions like “what’s the maximum data value?” using a boxplot.
Box (-and-whisker) plot – Rich Example
Since a boxplot is so narrow, it can be stacked together to compare many related distributions across categories:
sns.boxplot(data=df, x='Attack', y='Type')
Figure 5: Boxplot for Attack value of various pokemon, separated by type.
The Box Plot – more serious example
sns.boxplot(data = df, x ='salary', y ='team')
The plots seem left-justified. What does that mean about salaries?
Which teams pay the least? the most?
What do the circles mean?
Which team pays the highest median wage?
Professor Howald is considering quitting math and playing major league ball. What’s a realistic salary expectation?
Figure 6: Boxplot for MLB salary by team.
Plotting yourself
Once a dataframe is loaded, it’s not hard to make a scatterplot or boxplot:
import pandas as pd #Needed once, not for every plot import seaborn as sns #Needed once, not for every plotdf = pd.read_csv("filename.csv")sns.boxplot(data=df, x='Attack', y='Type')sns.scatterplot(data=df, x='homeownership', y='multi_unit')
See examples illustrated on previous slides.
Summary: Which plot type is best for each case?
To illustrate how time spent studying relates to course grade.
To show any relationship between religious identity and GPA.
To illustrate a link between height and weight.
To show the distribution of molecule sizes in a polymer.
To illustrate total sales by product type.
To visualize the masses and temperatures of thousands of stars.
To show whether pokemon with higher attack also have higher defense.
Putting it all together
Load the OpenIntro Run17 data. Make your own plots to answer each question.
What genders are represented and how many of each?
How are the run times distributed?
Are the run times unimodal? Bimodal? Something else? Why!?